p-values and confidence intervals

What you always wanted to know about statistical inference but were afraid to ask

Giusi Moffa
Bioinformatics group, University of Regensburg

The null ritual

  • The null ritual: what you always wanted to know about significance testing but were afraid to ask
    Gerd Gigerenzer, Stefan Krauss and Oliver Vitouch
    The Sage handbook of quantitative methodology for the social sciences, 2004
  • "It is tempting, if the only tool you have is a hammer, to treat everything as if it were a nail." (A.H.Maslow, 1966, pp. 15-16)
  • "Don’t use a hammer to swat a fly off someone’s head."
    John C Maxwell link
  •    

    Where do the misconceptions come from?

    • The earth is round (p < .05)
      Jacob Cohen, 1994, American Psychologist.

      "...this naked emperor has been shamelessly running around for a long time."

    • Statistical Inference: A Commentary for the Social and Behavioural Sciences.
      Michael Oakes, 1986, Wiley

    • Misinterpretations of Significance: A Problem Students Share with Their Teachers?
      Heiko Haller and Stefan Krauss
      Methods of Psychological Research Online 2002, Vol.7, No.1

      2 suspects: textbooks and statistics teachers
      psychology departments of 6 German universities

    What does a significant result mean?

    What can be concluded from a significant result?

    The p-value questionnaire (Haller and Krauss, 2002)

    Suppose you have a treatment that you suspect may alter performance on a certain task. You compare the means of your control and experimental groups (say 20 subjects in each sample). Further, suppose you use a simple independent means t-test and your result is (t = 2.7, d.f. = 18, p = 0.01). Please mark each of the statements below as “true” or “false.” False means that the statement does not follow logically from the above premises. Also note that several or none of the statements may be correct.


    Haller and Krauss asked 44 students, 39 lecturers and professors of psychology and 30 statistics teachers.

    1. You have absolutely disproved the null hypothesis (that is, there is no difference between the population means).
      [T] [F]
    2. You have found the probability of the null hypothesis being true.
      [T] [F]
    3. You have absolutely proved your experimental hypothesis (that there is a difference between the population means).
      [T] [F]
    4. You can deduce the probability of the experimental hypothesis being true.
      [T] [F]
    5. You know, if you decide to reject the null hypothesis, the probability that you are making the wrong decision.
      [T] [F]
    6. You have a reliable experimental finding in the sense that if, hypothetically, the experiment were repeated a great number of times, you would obtain a significant result on 99% of occasions.
      [T] [F]

    Why all statements are wrong

    Definition

    p_value: the probability of the observed data (or of more extreme data points), given that the null hypothesis $H_0$ is true, defined in symbols as $p(D \vert H_0)$.
    • Statements 1 and 3: illusion of certainty
    • Statements 2 and 4: \(p(D \vert H_0) \neq p(H_0 \vert D)\) [wishful thinking]
    • Statement 5: again a probability of a hypothesis [prosecutor's fallacy, e.g. Sally Clark]
    • Statement 6: replication fallacy, \(p(D) \neq 1-p(D \vert H_0)\)

    The psychologists' delusion about "p=.01"


    Percentage of participants in each group who endorsed at least one of the 6 statements.


    A closer look at the single questions


    Percentage of false answers in the three groups.


    Bayes Rule

    \[p(H_0 \vert D) = \frac{p(D \vert H_0) p(H_0)}{p(D \vert H_0)p(H_0) + p(D \vert H_1)p(H_1)}\]

    Point hypotheses examples

    \[H_0: \theta = \theta_0 \mbox{ vs } H_1: \theta \neq \theta_0\]

    Difficulties

    • priors on hypothesis? \[p(H_0), p(H_1)\]

    • non-trivial analysis, prior on parameter \[\pi(\theta)\]

    A Bayesian perspective as correction

    • Living with P values. Resurrecting a Bayesian perspective on frequentist statistics.
      Sander Greenland and Charles Poole
      Epidemiology, 2013

    with discussion by Andrew Gelman: "The formal view of the P value as a probability conditional on the null is mathematically correct but typically irrelevant to research goals (hence, the popularity of alternative—if wrong—interpretations)"

    • Revised standards for statistical evidence.
      Valen E. Johnson
      PNAS, 2013

    with discussion by Gelman and Robert, Pericchi et al, Gaudart et al, 2014

    • Uniformly most powerful Bayesian tests.
      Valen E. Johnson
      Annals of Statistics, 2013

    Classical and Bayesian (mis)-interpretation

    "P values are here to stay"

    Gelman on Greenland and Poole

    Most common misinterpretation (frequentist)

    "A p_value is the probability of chance finding."

    The Independent about the Higgs 5 sigma "meaning that there is less than a one in a million chance that their results are a statistical fluke."

    Most common Bayesian interpretation, turn them into Bayes factors

    \[\mbox{BF}_{01} = \frac{P(H_0 \vert D)/P(H_1 \vert D)}{P(H_0)/P(H_1)}\]

    a likelihood ratio for simple point hypotheses

    50:50 and 1:4 \(\Rightarrow 1/4\)

    Spike and slab

    • Point probability prior \(q\) at \(\theta=\theta_0\)
    • Symmetric unimodal distribution around \(\theta_0\) for the rest
    • prior odds: \(q/(1-q)\)
    • \(P_0= .10, .05, .01 \Rightarrow\) lower bounds (normal priors) \(\mbox{BF}_{01} = .63, .47, .15\)

    \[P_0=.05, q=1/2 \Rightarrow P(H_0 \vert D) \geq .47/(1+.47) = .32\]

    Conceptual difficulties

    • cannot be read from tables
    • require \(q\), an arbitrary commitment to a single \(\theta\)

    Bayesian interpretations

    • Define conditions under which \(P\) values are indeed probability statements about the true parameter \(\theta_t\). (Deep waters of improper, weak and subjective priors - Gelman)

    Alternative interpretations of P values

    • measures of: goodness of fit, distance, consistency, compatibility between the observed data and the data generating model,

      • \(P_\theta\) probability transform of \(\vert \theta - \hat{\theta}\vert\). Small \(P_\theta \Rightarrow\) poor fit (but not limited to \(\theta \neq \theta_t\))
    • extreme-priors and weak priors (approximately true statements)

      • \(P_\theta\): posterior probability that \(\vert \hat{\theta} - \theta_t \vert > \vert \hat{\theta} - \theta \vert\)
      • \(P_\theta/2\): posterior probability that \(\hat{\theta}\) is on the wrong side of \(\theta\) with respect to \(\theta_t\)
    • informative priors

      • bounds on posterior probabilities with prior restricted to a given class

    Revised standards for statistical evidence

    Val Johnson, PNAS 2013, ahead of printing

    Motivation

    • "It ain't so much the things we don't know that get us into trouble. It's the things we know that just ain't so." Uncertain source.

    • Lack of reproducibility of scientific studies

    Due to unjustifiably high levels of significance?

    Two frameworks for statistical hypothesis testing:

    • Classical or frequentist, "significant" when test statistic exceeds a threshold
    • Bayesian, posterior odds (require an alternative)

    Uniformly most powerful Bayesian tests

    • Can P values and Bayes factors be calibrated to yield similar conclusions?
    • Connection between UMPT and UMPBT
    • By analogy to the Neymann Pearson lemma, define UMPBT such that
      • Given equipoise \(P(H_0)=P(H_1)=.5\)
      • Evidence threshold \(\gamma>0\)
      • For any \(\theta_g \in \Theta\) and all alternative \(H_{1'}: \theta \thicksim \pi_{1'}\) \[P_{\theta_g} (\mbox{BF}_{10}(x) > \gamma) \geq P_{\theta_g} (\mbox{BF}_{1'0}(x) > \gamma)\]
    • Correspondence between size of classical tests and evidence threshold
    • Rejection regions can be matched exactly
    • Decisions will match!

    P values vs UMPBT Bayes factors

       
  • Common thresholds correspond to only moderate evidence against the null
  • Correct the problem by raising thresholds (lower significance levels)
  • But beware: "Don’t use a hammer to swat a fly off someone’s head."
    Raising the bar does not come for free!
  • The same game with confidence intervals




    Weakly informative blog entries

    1. Misunderstanding the p-value
    2. Double Misunderstandings About p-values
    3. P-values and statistical practice
    4. Difficulties in making inferences about scientific truth from distributions of published p-values
    5. “Are all significant p-values created equal?”
    6. Forum in Ecology on p-values and model selection
    7. Revised evidence for statistical standards (Gelman, Xian )
    8. Revised statistical standards for evidence (comments to Val Johnson’s comments on our comments on Val’s comments on p-values)
    9. Jessica Tracy and Alec Beall (authors of the fertile-women-wear-pink study) comment on our Garden of Forking Paths paper, and I comment on their comments

    Statistically significant papers

    1. Forum—P Values and Model Selection, March 2014, Ecology special issue
    2. Revised standards for statistical evidence, Valen E. Johnson, PNAS 2013, ahead of printing
    3. An estimate of the science-wise false discovery rate and application to the top medical literature, Leah R. Jager and Jeffrey T. Leek, Biostatistics (2014) , with mixed discussions (e.g. Yoav Benjamini, David R. Cox, Andrew Gelman, John P. A. Ioannidis)
    4. The garden of forking paths: Why multiple comparisons can be a problem, even when there is no "fishing expedition" or "\(p\)-hacking" and the research hypothesis was posited ahead of time, Gelman and Loken, 2013
    5. Too Good to Be True. Statistics may say that women wear red when they’re fertile … but you can’t always trust statistics.
    6. Why we (usually) don't need to worry about multiple comparisons, Gelman et al, 2012

    Some evidence that most published research is false

    1. I don’t believe the paper, “Empirical estimates suggest most published medical research is true.” That is, most published medical research may well be true, but I’m not at all convinced by the analysis being used to support this claim. (January 2013)
    2. Statistical significance and the dangerous lure of certainty (August 2013)
    3. Difficulties in making inferences about scientific truth from distributions of published p-values (September 2013)
    4. A summary of the evidence that most published research is false
    5. Is most science false? The titans weigh in
    6. The replication and criticism movement is not about suppressing speculative research; rather, it’s all about enabling science’s fabled self-correcting nature (February 2014)

    Water, water, every where,

    Nor any drop to drink

    The Rime of the Ancient Mariner, Samuel Taylor Coleridge, 1798

    Lies, damned lies and... big data. The rise and fall of statistics.